Introduction

As required, this task was an open one, so the students had to choose a specific topic on their own. Our Group did choose a dataset we found on https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset#subset. This subset Contains 10k Music files and is around 2GB big. The actual dataset is about 300GB big and has arround 1 MIllion entries, in this case songs. Besids the Analysis, the dataset includes some Metadata, like Author, produced year etc. and finally music data features for each song in HDF5 format. The actual Provider of this data set is THE ECHO NEST (http://the.echonest.com), which used to be a music intelligence and data platform for developers until Spotify (https://www.spotify.com/de/), a famous music streaming provider, acquired The Echo Nest.1 As provided by the information about the dataset, it is a result of an collaboration between THE ECHO NEST and LabROSA (https://labrosa.ee.columbia.edu). 2

Our goal in this Project is going to be an analysis of some songfiles that we prefer. Since all of the musicfiles are labeled with artist- and songnames as well as the year of production, we can find allmost every song eather on YouTube (https://www.youtube.com) or on Spotify (https://www.spotify.com/de/). First we are going to listen to some of the songs to find the ones that we prefer. Further, we are going to analyze that songs to have a good understanding of data that describes our preferation. Last we are going to use spotify for prediction. Thus we hope for a better analysis and understanding of the given data. Otherwise we would be comparing mostly different data that is not suitable for research purposes.

Alongside with the above analysis we also want to have some more general information about the artists and their songs. Therefore we are going to visualize some general information too.

Handle the downloaded data

After downloading and unzipping the data, one can see two different folders. The first one, ‘data’, containing several other folders and the second one ‘AdditionalFiles’, containing some adittional files in either SQL or txt format. The directory structure is based on The Echo Nest Track IDs 3. The ‘data’ folder contains exlusively songfiles in HDF5 (Hirarchical Data Format 5) format. This format is mostly used in science apllications for big datasets. It was developed by NASA 4 to handle large, heterogeneous and hirarchical datasets. The content of those files handles some analysis, some metadata and some more information that is stored on MusicBrainz (https://musicbrainz.org), an open music encyclopedia. The data availabla in ‘AdditionalFiles’ is going to be used for first hands on the whole dataset, to get to know the dataset since the access is simple. By doing so we will prevent some general information about the dataset. To read both datafolders one should install some additional packages that will be mentioned later on.

For more information about the dataset especially about the frequent asked questions we recomend to go to (https://labrosa.ee.columbia.edu/millionsong/faq).

Preprocess the Additional files

When accessing the data provided in ‘AdditionalFiles’ folder, one has to remove the Seperators <SEP> and replace those with a common seperator like ‘;’. This should be done, because R is used to a one byte seperator and therefor it is not possible to read a file with a seperator like <SEP>.

The following codechunk was only used to access the txt files in RStudio.

# Load preprocessed data and name the columns
location <- read.csv2('data/subset_artist_location.txt',sep = ';', header = FALSE, col.names = c('artistId', 'lat','lon',  'trackID', 'artistName'))
artists <- read.csv2('data/subset_unique_artists.txt',sep = ';', header = FALSE, col.names = c('artistId', 'V2', 'trackID', 'artistName'))
tags <- read.csv2('data/subset_unique_mbtags.txt',sep = ';', header = FALSE, col.names = c('tags'))
uni_terms <- read.csv2('data/subset_unique_terms.txt',sep = ';', header = FALSE, col.names = c('terms Unique' ))
tracks <- read.csv2('data/subset_unique_tracks.txt',sep = ';', header = FALSE, col.names = c('trackID','V2', 'artistName','songName'))
tracksPerYear <- read.csv2('data/subset_tracks_per_year.txt',sep = ';', header = FALSE,  col.names = c('Year', 'trackID', 'artistName','songName'))

General information visualized

The following code loads the packages that are required to make a wordcloud. Furthermore while creating a wordcloud, one will notice that the first created wordcloud, has a very bad distribution. Mostly because of the most common words in english language. Those words do not have a meaning for this purposes. Therefor, according to the observation and an wikipedia article 5, one should wipe up the dataset from this words. Thus the recomendation is to use ‘the’,‘and’ and ‘a’ to clean the dataset.

Describing the required packages, it is important to undesrstand what each package is used for in the following codechunk. Starting with ‘tm’ (Text Mining Package), that is common to use for wordcloud and handling different strings. Firstly one should take a closer look at Corpus that creates a collaction of corpora 6. Secondly one should create a Vector Source for the Corpus function and finally tm_map, which is an Interface that applies transformation functions to corpora objects. Also a very important function content_transformer is used to create a wrapper to get and set a content of a document. This steps where used to preprocess the documents. After doing so one should also consider to create a term document matrix, which contains every Term in documents and the documents it does appear in.

The package ‘wordcloud’ is a very usefull one, and does provide a graphical representation of the frequencies of used words in one or more documents 7. This wordclouds can be seen in the following plots.

Visualize artisnames

# Load packages
# library("NLP")
library("tm") # for text mining
# library("SnowballC") # for text stemming
library("RColorBrewer") # color palettes
library("wordcloud") # word-cloud generator 

docs <- Corpus(VectorSource(as.String(artists$artistName)))

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))

# most common words in english that do not have a meaning for this puposes
others <- c('the','and','a')

# convert the found words to ''
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
for (i in 1:length(others)){
docs <- tm_map(docs, toSpace, others[i])
}

# calculate frequency of occuring words
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(words = d$word, freq = d$freq, min.freq = 1, scale = c(3,0.2),
          max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
mtext('Artistnames', side = 2, line = 1, adj = 0.5) # title

When looking at the wordcloud above one can see, that the most common artistnames are eather orchestr or John. Also there are some spanisch artist names containing words like los. This could be used for a better knowledge about the dataset. It is more clear that artists do not only come from england europe or the US but also from Spain or Latin America. One could of course get rid of the prepositions in all languages the dataset contains. Thus the preposition los would alse be wiped out. Some other names like Joe or King are also quite common used. To make some more assumptions and to get a better understanding of the wordcloud, the actual frequencies of the very frequent entries where provided in a table.

# show only head of frequency dataFrame
head(d,8)
##              word freq
## john         john   41
## orchestr orchestr   38
## los           los   31
## vid           vid   31
## turing     turing   25
## joe           joe   21
## bro           bro   19
## king         king   19

Together with this table and the wordcloud one could gain a better understanding of the distribution of the artistnames in the given dataset. Now it is interesting to get some more facts about the most common name John. After a small research on the internet [research] one can see, that John was one of the most common names in the 1990’s. To proove, that this name occure mostly in the 1990’s in the dataset one should take a closer look on those years.

tracksPerYear$artistName[tracksPerYear$Year >= 1990 & tracksPerYear$Year <= 2000]
##  [1] K's Choice            K's Choice            Kaija Koo            
##  [4] Kisha                 Lee Ritenour          Les Malpolis         
##  [7] Lisa Lynne            Los Amigos Invisibles Los Amigos Invisibles
## [10] Luciana Souza         M.A. Numminen         Mandi                
## [13] Martin Sexton         Martin Sexton         Mithotyn             
## [16] Mithotyn              Monster Magnet        Moonspell            
## [19] Mudhoney              Natural Elements      Nic Endo             
## [22] Old Man's Child       OutKast              
## 1149 Levels: !!! 2 Minutos 2-4 Grooves feat. Reki D. ... Zombina & The Skeletones

After displaing the actual dataset and the entries of the artistnames between the years 1990 and 2000, the assumption made before should be declined. However one can see another common word in the displayed subset ‘Los’. This set needs to be more described and explored, because the previous exploration does not provide a lot of information.

Visualize songnames

Almost the same analysis was done on common songnames. However the common words in this case where not quit the same as in the script before. The method to find common songnames was firstly plot the wordcloud as an uncleaned version, containing all possible words. After deciding which words do not have a proper meaning to the final statement it is obvious to delete those words. Thus the cleaning with found words like ‘the’,‘version’,‘and’,‘from’, ‘feat’ and ‘album’ created the following wordcloud.

# Load packages
#library("NLP")
#library("tm") # for text mining
#library("SnowballC") # for text stemming
#library("RColorBrewer") # color palettes
#library("wordcloud") # word-cloud generator 

docs1 <- Corpus(VectorSource(as.character(tracks$songName)))

# Convert the text to lower case
docs1 <- tm_map(docs1, content_transformer(tolower))

others <- c('the','version','and','from', 'feat','album')
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
for (i in 1:length(others)){
docs1 <- tm_map(docs1, toSpace, others[i])
}

dtm <- TermDocumentMatrix(docs1)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(words = d$word, freq = d$freq, min.freq = 1, 
          max.words=100, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"), main = "TITL")
mtext('Songnames', side = 3, line = 0, adj = 0.5) # title

Looking at the result one can see the frequently words ‘you’ and ‘love’. Interpreting this result, it is obvious that this dataset consists of Songnames that are more likely to handle Love and the counterpart of a Human, you. A general assumption could be, that there are more songs handling Love, the counterpart of someone and the live, then about technic or traveling for example. However this assumption can not be completle prooven since this dataset does not represent all the songnames in the world.

Also by looking at the folowing table, one can have a better and more detailed information about the distribution of the songnames.

head(d,7)
##      word freq
## you   you  540
## love love  332
## live live  216
## for   for  185
## all   all  144
## your your  143
## don   don  137

Visualize artist locations

Since it is clear that the dataset not only contains artists from england or europe or the US, it would be nice to have a proper listing of the world together with the location of the artists. This can be achived through the package ‘maps’ 8. This package provides not only a method to draw a map by accsessing it through a word like world but also by giving this method a border by longitude and latitude to get a closer look on different parts of it. It is easy to use and draw complex maps as well as set some points on the map. The worldmap below shows all artists with their locations. Unfortinately the dataset does not provide a location for each containing artist, but nevertheless it creates a good overview.

library(maps)
#library(mapdata)
#library(eurostat)

# parse the lat and lon values of given set 
lon <- as.double(as.character(location$lon))
lat <- as.double(as.character(location$lat))

# delete all NaN
lon <- lon[!is.na(lon)]
lat <- lat[!is.na(lat)]

coordinates <- as.data.frame(cbind(lon, lat))

# take a closer look at europe 
#europe <- as.data.frame(cbind(lon = c(54.78333, 24.08464, -31.26192, 59.34569), lat = c(80.56667, 34.83469, 39.45479, 62.21215)))

map('world',c('.'), col = "grey80", fill = TRUE, border = "grey40") 
points(coordinates$lon, coordinates$lat, col = "red", cex = .1)

#x <- map('world', xlim = range(europe$lon), ylim = range(europe$lat), namefield = TRUE)
#x$names <- gsub("\\:.*","",x$names)

The assumptions about the artists made beforehand are completly right. The data not only consist of european and english speaking artist but also of people from arround the world. Mostly the artists come from the US and Europe, some even from Russia or Australia or as assumed before from Latin America. By looking at the songnames it is hard to tell weather the artist is from Australia or the US. With this representation one can have a better understanding of the dataset and finally clarify the unclarified.

An even closer look on Europe is provided by the map below. Due to the blurred representation of Europe in the worldmap, this map was created. Especially because of the actual location of the author and the location of the university this representation was chosen.

map(col = "grey80", border = "grey40", fill = TRUE,
  xlim = c(-25, 45), ylim = c(36, 70), mar = rep(0.1, 4))
points(coordinates$lon, coordinates$lat, col = "red", cex = .3)

#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5) # required for H5 files

# set a hardcoded Path to the MillionSongSubset
pathToSet = '/Users/Kostja/Desktop/Master/Sem 2 (18 SoSe)/Data Visualization/Tasks/MillionSongSubset'

# create array with found Ids in beforehand containing prefered songs
TrackIDs <- array(c('TRAPZTV128F92CAA4E','TRANNZZ128F92C22F7','TRAQZQX128F931338F','TRALONM128EF35A199','TRAWBHE12903CBC4CB'))

# find automaticaly all paths with names of trackIDs
SubPaths <- lapply(TrackIDs,function(x){
  list.files(pathToSet, x, recursive=TRUE, full.names=TRUE, include.dirs=TRUE)
})

# beautify the dataset 
SubPaths <- data.frame(SubPaths = t(unlist(SubPaths)))
names(SubPaths) <- c('beyonce', 'justin', 'kanye', 'madonna', 'bruno')



# read the H5 files and create a readable output
artist <- lapply(SubPaths, function(x){
  h5ls(toString(x))
})



Analyze_song <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/songs")
})
Analyze_song <- do.call(rbind, Analyze_song)

Meta_song <- apply(SubPaths,2,function(x){
  h5read(x,"/metadata/songs")
})
Meta_song <- do.call(rbind, Meta_song)

library(fmsb)

radarFrame <- function(df1, df2){
  matrix <- cbind('artist_familiarity' = df1$artist_familiarity, 'artist_hotttnesss' = df1$artist_hotttnesss, 'tempo'= df2$tempo, 'time_signature' = df2$time_signature, 'loudness' = df2$loudness, 'key' = df2$key) 
  rownames(matrix) <- rownames(df1)
  matrix <- data.frame(matrix)
}

namesLegend <- paste(Meta_song$artist_name,Meta_song$title)

radar <- function(df, namesLeg = namesLegend, x = -2.8 , y= -1.1){
  transparency <- adjustcolor(1:dim(df)[1], alpha.f = 0.2) 
  # Custom the radarChart !
  radarchart( df  , axistype=1 , maxmin = FALSE,
    #custom polygon
    pcol=1:dim(df)[1], plwd=1 , pfcol = transparency ,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol=FALSE ,
    #custom labels
    vlcex=0.8 
    )
  par(xpd=TRUE)
legend(x,y, legend = namesLeg, bty = "n", pch=20 , col=1:dim(df)[1] , cex=0.8, pt.cex=2)
}

data <- radarFrame(Meta_song, Analyze_song)

radar(data)

# anschauen für radar 
# artist familarity unter metadata
# hotness sind aber estimateionen dh von EchoNest berechnet und schwierig in der absoluten umgehensweise
# tempo in songs vergleichen mit anderer Seite weil nicht ganz richtig 
# time signature in songs auch mit anderer Seite vergleichen  beides aus dem gleichen Datensatz daher auch der gleiche Fehler, wenn nun anderer datensatz dazukommt kann es dazu kommen, dass der Fehler nicht mehr reproduzierbar ist und der bias komplett verfälscht wird und wir somit keine Aussage mehr treffen können.
# loudnes in songs
# key in songs

# Alles was oben ist von einer anderen Seite daten nehmen und radar plot erstellen zum vergleich

# loudnes max als detailierter wert 
compareFrame <- data.frame(rbind(
  beyonce = c('familiarity' = 70, 'tempo' = 97, 'time_signature' = 4, 'loudness' = -5,'key' = 1),
  justin = c('familiarity' = 70, 'tempo' = 76, 'time_signature' = 4, 'loudness' = -5,'key' = 7),
  kanye = c('familiarity' = 65, 'tempo' = 106, 'time_signature' = 4, 'loudness' = -5,'key' = 9),
  madonna = c('familiarity' = 54, 'tempo' = 119, 'time_signature' = 4, 'loudness' = -7,'key' = 9),
  bruno = c('familiarity' = 70, 'tempo' = 104, 'time_signature' = 4, 'loudness' = -6,'key' = 10)
))

# because all timesignatuires are 4, there is no proper graph 
# radarchart draws relatively
radar(compareFrame)

# not realy comparable as seen
par(mfrow = c(1,2))
radar(data,x=-2.2, y = -1.2)
radar(compareFrame,x=-2.2)

par(mfrow = c(1,1))

beyonce trackid TRAPZTV128F92CAA4E justin trackid TRANNZZ128F92C22F7 kanye trackid TRAQZQX128F931338F madonna trackid TRALONM128EF35A199 bruno mars TRAWBHE12903CBC4CB

# library(fmsb)
# Tune_Beyance
# Tune_Justin <- c(,,76,,-5,8)
# Tune_Kanye
# Tune_Bruno
# Tune_Madonna

loudness_start <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_loudness_start")
})

loudness_max <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_loudness_max")
})

par(mfrow= c(1,2))
boxplot(loudness_start, main = 'loudness_start' )
boxplot(loudness_max, main = 'loudness_max' )
mtext('Boxplots of loudness', outer = TRUE, side = 3, line = -1)

par(mfrow= c(1,1))


Draw_matrix_plots <- function(plt){
  layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), 2, byrow = TRUE), heights=c(2,2))
  c <- 0
  invisible(lapply(plt,function(x){
  c <<- c+1
  plot(x,type = 'l', axes = FALSE, xlab = '', ylab = '', main = names(plt)[c])
  axis(2)
  axis(1)
  }))
  mtext(paste('Plot', deparse(substitute(plt)),'for different interprets' ), side = 3, line = -19, outer = TRUE)
  par(mfrow=c(1,1))
}

Draw_matrix_plots(loudness_start)

Draw_matrix_plots(loudness_max)

matplot_Draw <- function(plt){
  dFrame <- do.call(cbind, plt)
  matplot(dFrame,type = "l", col = 1:dim(dFrame)[2], ylab = "loudness", xlab = 'segmentstep', main = paste('matplot', deparse(substitute(plt))))
  legend("topleft", legend = names(plt), col = 1:dim(dFrame)[2], pch = 16)
}

matplot_Draw(loudness_start)
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 2)

matplot_Draw(loudness_max)
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 2)

# nicht sicher mit dem hier 
Analyze_pitch <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_pitches")
})
boxplot(Analyze_pitch)

Analyze_timbre <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_timbre")
})

boxplot(Analyze_timbre)

Conclusion

The H5 data explained: https://labrosa.ee.columbia.edu/millionsong/pages/example-track-description

Sources

european limits : http://www.milanor.net/blog/maps-in-r-introduction-drawing-the-map-of-europe/ vergleichseiten: http://www.findsongtempo.com und http://www.tunebat.com


  1. https://en.wikipedia.org/wiki/The_Echo_Nest page view [02.07.18]

  2. https://labrosa.ee.columbia.edu/millionsong/ page view [02.07.18]

  3. TR+LETTERS + LETTERS&NUMBERS so the directorypath within the dataset is based on the first 3 letters after the 3rd one e.i ‘MillionSong/data/A/D/H/TRADHRX12903CD3866.h5’

  4. (National Aeronautics and Space Administration) https://www.nasa.gov/about/index.html page view [02.07.18]

  5. (https://en.wikipedia.org/wiki/Most_common_words_in_English) page view [26.06.18]

  6. (https://cran.r-project.org/web/packages/tm/tm.pdf) page view [26.06.18]

  7. (https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf) page view [27.06.18]

  8. https://cran.r-project.org/web/packages/maps/maps.pdf page view [25.06.18]